<HTML>

<HEAD>
<TITLE>Learning Algorithms for Keyphrase Extraction</TITLE>
</HEAD>

<BODY BGCOLOR="#ffffff">

<MAP NAME="banner_top">
<AREA SHAPE="rect" COORDS="588,14,620,40" HREF="http://www.iit.nrc.ca/english.html">
<AREA SHAPE="rect" COORDS="538,14,583,37" HREF="http://www.corpserv.nrc.ca/corpserv/nrc.html">
<AREA SHAPE="rect" COORDS="86,4,421,37" HREF="http://www.iit.nrc.ca/II_public/index.html">
</MAP>

<table cellpadding="5" cellspacing="0" border="0" width="100%">
<tr><td valign="bottom" align="left">
<IMG SRC="../banner_top.jpg" width="620" height="37" alt="II Group Banner" 
USEMAP="#banner_top"ISMAP border="0"><BR><IMG SRC="../banner_extractor.jpg" 
width="217" heigth="49" alt="Extractor">
</td></tr>
</table>

<FONT COLOR="#400080">
<H2>Learning Algorithms for Keyphrase Extraction</H2>
</FONT>

<HR>

<TABLE>
<TR>
<TD>
<B>Format</B>
</TD>
<TD>
<B>File</B>
</TD>
<TD ALIGN=RIGHT>
<B>Bytes</B>
</TD>
<TD>
<B>Viewer</B>
</TD>
</TR>

<TR>
<TD>
PDF
</TD>
<TD>
<A HREF=IR2000.pdf><CODE>IR2000.pdf</CODE></A>
</TD>
<TD ALIGN=RIGHT>
274,281
</TD>
<TD>
<A HREF=http://www.adobe.com/prodindex/acrobat/readstep.html>free PDF viewer</A>
</TD>
</TR>

<TR>
<TD>
PostScript
</TD>
<TD>
<A HREF=IR2000.ps><CODE>IR2000.ps</CODE></A>
</TD>
<TD ALIGN=RIGHT>
3,540,317
</TD>
<TD>
<A HREF=http://www.cs.wisc.edu/~ghost/>free PostScript viewer</A>
</TD>
</TR>

<TR>
<TD>
Compressed PostScript
</TD>
<TD>
<A HREF=IR2000.ps.Z><CODE>IR2000.ps.Z</CODE></A>
</TD>
<TD ALIGN=RIGHT>
817,384
</TD>
<TD>
<A HREF=http://www.cs.wisc.edu/~ghost/>free PostScript viewer</A>
</TD>
</TR>
</TABLE>

<HR>

<DL>
<DT> Turney, P.D. (2000).
<DD> Learning algorithms for keyphrase extraction. 
<DD> <A HREF="http://www.wkap.nl/journalhome.htm/1386-4564">
<I>Information Retrieval</I></A>, 2 (4): 303-336.
</DL>

<HR>

<H2>Abstract</H2>

Many academic journals ask their authors to provide a list of about 
five to fifteen keywords, to appear on the first page of each article. 
Since these key words are often phrases of two or more words, we prefer 
to call them keyphrases. There is a wide variety of tasks for which 
keyphrases are useful, as we discuss in this paper. We approach the 
problem of automatically extracting keyphrases from text as a supervised 
learning task. We treat a document as a set of phrases, which the 
learning algorithm must learn to classify as positive or negative 
examples of keyphrases. Our first set of experiments applies the C4.5 
decision tree induction algorithm to this learning task. We evaluate 
the performance of nine different configurations of C4.5. The second 
set of experiments applies the GenEx algorithm to the task. We developed 
the GenEx algorithm specifically for automatically extracting keyphrases 
from text. The experimental results support the claim that a custom-designed 
algorithm (GenEx), incorporating specialized procedural domain knowledge, 
can generate better keyphrases than a general-purpose algorithm (C4.5). 
Subjective human evaluation of the keyphrases generated by GenEx 
suggests that about 80% of the keyphrases are acceptable to human readers. 
This level of performance should be satisfactory for a wide variety of 
applications.

<HR>

<CENTER>
<table border="1" bgcolor="#ccccff">
<tr><td><font size=2>
[ <a href="http://extractor.iit.nrc.ca/">Extractor Home</a> |
<A HREF="http://www.iit.nrc.ca/II_public/french.html">Fran&ccedil;ais</a> |
<A HREF="http://www.iit.nrc.ca/english.html">IIT</A> |
<A HREF="http://www.iit.nrc.ca/II_public/index.html">II Group</A> |
<A HREF="http://www.corpserv.nrc.ca/corpserv/nrc.html">NRC</A> |
<A HREF="http://ai.iit.nrc.ca/search.html">Search</A> |
<A HREF="mailto:Peter.Turney@iit.nrc.ca">Feedback</A> ]
[ <I>Updated</I>: August 28, 2000 ]</font size=2></td></tr>
</table>
</CENTER>

</BODY>
</HTML>

